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Abstract — We consider unsupervised crowdsourcing perfor- 
mance based on the model given in 1 13 1 wherein the responses of 
end-users are essentially rated according to how their responses 
correlate with the majority of other responses to the same 
subtasks/questions. In one setting, we consider an independent 
sequence of identically distributed crowdsourcing assignments 
(meta-tasks), while in the other we consider a single assignment 
with a large number of component subtasks. Both problems yield 
intuitive results in which the overall reliability of the crowd is a 
factor. 

Index Terms — Crowdsourcing, unsupervised learning, consen- 
sus, design, performance, error rate. 



I. Introduction 

On-line crowdsourcing addresses the problem of solving a 
large meta-task by decomposing it into a large number of small 
tasks/questions and assigning them to an online community 
of peers/users. Examples of decomposable meta-tasks include 

ii, HD: 

• annotating (including recommending) or classifying a 
large number of consumer products and services, or data 
objects such as documents |12|, web sites (e.g., answering 
which among a large body of URLs contains pornogra- 
phy), images, videos; 

• translating or transcribing a document |i6J possibly in- 
cluding decoding a body of CAPTCHAs ifTTl : 

• document correction through proofreading jS], pOl; and 

• creating and maintaining content, e.g., Wikipedia and 
open-source communities. 

General purpose platforms for on-line crowdsourcing include 
Amazon's Mechanical Turk |[T|, lfT2l . 1114] and Crowd Flower 
0. 

Users responding to questions may do so with different 
degrees of reliability. If p is the probability that a user 
correctly answers a question, let the expectation Ep be taken 
over the ensemble of users. Thus, Ep is a measure of the 
reliability of the majority and, fundamentally, whether the 
positive correlation with the majority ought to be sought 
for individual users (as is typically assumed in many online 
unsupervised "polling" systems). 

A user population is arguably reliable (Ep > 0.5) when the 
population /f^e// ultimately decides the issue (e.g., confidence 
intervals for an election poll), or the questions concern a 
commonplace issue with commonplace expertise among the 
population (e.g., whether a web site contains pornography), 
or the population is significantly financially incentivized to be 
accurate [i.e., incentivized to acquire the required expertise 
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to be accurate). Some market-based crowdsourcing scenarios 
(e.g., questions of investing in stocks of complex companies), 
or analogies to bookmaking (setting odds so that the house 
always profits), may not be relevant here, i.e., scenarios 
where questions are pushed to users who minimally profit 
by answering them correctly. That is, for some specialized 
technical issues, it may be possible that the "crowd" will 
be unhelpful (Ep « 0.5) or incorrectly prejudiced/biased 
(Ep < 0.5). In many cases, the users may need to be 
paid for questions answered lfT9l . Thus, the crowdsourcer is 
incentivized to determine the reliability of individual users in 
a scalable fashion. 

This paper is organized as follows. The iterative, unsuper- 
vised framework and assumptions of ifTSl in Section III] In 



Section III we find expressions for the mean and variance 
of the parameters (y) used to weight user answers after one 
iteration, under certain assumptions related to the regular- 
ity of connectivity of the bipartite graph matching users to 
questions/sub-tasks. We also state the existence of a fixed 
point for a normalized version of the user-weights iteration. 
To derive an asymptotic result, we consider the user weight 
iteration spanning a sequence of independent and identically 
distributed (i.i.d.) meta-tasks, with one iteration per meta-task. 
We give the results of a numerical study for the original 
system (multiple iterations for single meta-task) in Section 
[V] In Section VI how the crowdsourcing system of |13 | is 
related to LDPC decoding is decribed. Finally, we conclude 



with a summary in Section VII 



II. Model Background 

In lfT3l . a single meta-task is divided into a group of IQj 
similar subtasks/questions i e Q for which the true Boolean 
answers are encoded Zi e {—1,1}. These questions are 
assigned to a group of U users a E U. If a is assigned 
question i, then his/her answer is Aia G {^l, !}• Again, the 
questions i are assumed similar so we model user a with a 
task-independent parameter pa which reflects the reliability of 
the user's answer: for all i, 

P{A,a = Zi) = Pa and P{Aia = -z,) = 1 - 

so that 



EAia = Z,Pa - Zi{l ~ Pa) 

var(A„) = l-(2p„-l)2 = 



= z^{2pa - 1) and 

4j3a(l-Pa). 



Suppose that the response to question i is determined by the 
crowdsourcer as 




where di C U is the group of users assigned to question i and 
yb^i is the weight given to user b for question i. 

If Ub^i is the same positive constant for all b, i, then the 
crowdsourcer is simply taking a majority vote without any 
knowledge of the reliability of the peers. 

One approach to determining weights y is to assess how 
each user a performs with respect to the majority of those 
assigned to the same question i. The presumption is that the 
majority will tend to be correct on average. Given that, how 
can the crowdsourcer identify the unreliable users/respondents 
so as to avoid them for subsequent tasks? Accordingly, a 
different weight yi^a can be iteratively determined for each 
user a's response to every question i in the following way 

m- 

• Initialize i.i.d. yf\i ^ N(l, 1), i.e., initially assume each 
user is roughly reliable with Ey^'^^ — 1 and P(?/'^*'^ > 
0) « 0.84. 

• For step k > 1: 

fc.l: xfl^ = Ebeaj\a^jbyi''-7p^ consider the 
weighted answer to question j not including user a's 
response. 

VaXi = Eiea-iaV^jaxJ^^a' i-'^-' coiTelate the 
responses of the other users with those of user a 
over all questions assigned to a except i. 
Here d^^a is the set of questions assigned to user a. The 

(k) 

distribution of y^j!., as a function of iteration k is studied in 
^3) for degree-regular assignment of questions to users. Note 

(k) 

that by simply eliminating Xj^^ we can write 

So, y^l^j depends on the responses of a's one-hop neighbors 
in U, 

ArW^ {beU \ 3] e d-^a\i s.t. b G dj\a}, 

i.e., not including a itself or any one-hop neighbors of a (in 
U) also assigned to question i. 

III. Distribution of user weights after one 

ITERATION 

A. Degree-regular graph 

For the degree-regular assignment, we can relate the number 
of users per question r := \di\ Vi to the number of questions 
per user s := |i9^^a| Va: 

r\Q\ - s\U\. 

In the following, we will assume 

r > 2 and s > 2. 

Furthermore, we may assume that all sets iV^i^-i same 
size N, where generally 

N < (r — l)(s — 1) members, 



but that the number of terms summed in Q is always equal 
to (r- l)(s- 1). 

To form such degree-regular assignments, one can simply 
iterate over the (enumerated) questions: 

0. i = 1 (first question e {1, 2, \Q\}). 

1. assign i to r different users € U chosen uniformly at 
random. 

2. Va G C/ such that \d-^a\ = s, U ^ U\{a}. 

3. if i < \Q\, i — > z + 1 and go to step 1. 

Note that since r\Q\ = s\U\, the questions will be exhausted 
just when the users are (i.e., when i — >■ U (d). 

B. First-iteration variance and mean of user weights, y 

Let Ofj - EfceaAa^.fyI°l, "o^e that ofj is inde- 
pendent of oj?] for all a and j ^ j'. So, if /i^'^^ := Eyf]^.^ = 
1, the mean of y^^j is 

■■= E E E^J"-! (byindep.) 

E ^j(2Pa-l) E ^j(2Pfc-l) 

j^d~^a\i b^dj\a 

= (2p,-i) J2 E (2p;.-i) 

j^d~'^a\i b^dj\a 

(since — I a.s.). Also, assume the variance var^*^^ 
var(yi") J = 1 (^ E(yi°) = 2). So, 

vai-i'Z. = E ^^i^J-OfJ) (byindep.) 

= E [E(OfJ)^-(2p.-l)^(EOgV] 

= Y: [var(0fJ)2 + (l-(2p,-l)^)(E0fj)2] 

j^d~^a\i 

= E [{ E var(A,,yn,)} + 

il-{2pa-lf)(EOfJf] (byindep.) (3) 
= E [{ E (2-(2p.-l)^)} + 

(l-(2p,-l)2)( J2 2P6-If]. (4) 

bedj\a 

C. Assumption of large number of users per question, r 

Finally, for simplicity, we may additionally assume suffi- 
ciently large r (number of users per question) and the neighbor 
selection is uniformly distributed so that, for all a, j, 

V (2pfc-l) « E(2p-1) = 2Ep-l. (5) 

r — 1 

bedj\a 

The following lemma is now obtained simply by substitu- 
tion. 

Lemma 1. For ^ under (|5|, for all a, i: 

^^a\^ - (s-l)(r-l)(2p,-l)(2Ep-l) (6) 



and 



(s-l)(r-l)[E(2-(2p-l)^) 



+ (1 - (2p„ - lf){r - l)(2Ep - 1)2]. (7) Lemma 3. // (|5|, ji-^Xi = 1 ^'«'"a^i = «o /o'" a, i 

then for k> 1 

Though the U xQ matrix y'^'"-' with elements Yaj := j/i^ili, 

is Markovian, directly proceeding along these lines for var^^^, var^^' < vqS~'^ + r(2Ep — 1)^^-1 ^ j-j 

fc > 2, is complicated by the dependence of the terms involved " * (f)'^ — 5^^ 

through the structure of the bipartite graph mapping users U ^ 

to questions Q, cf.. Section m P'^'^'f- Proceeding as for and using the independence 



at ([3| afforded by (|9]l, gives 

D. Discussion: Normalized weights ~ (fc) _ 1 rr / aW ~ik-l}^^ 

It's possible that the weights y^'^' may be unbounded in k. " jea-ia<'')\i bedj(''>\a 

Instead of (|2]i, for a degree -regular assignment suppose the „ .^-^ (fe-i)i9i 

weights are^for all a, z, + (1 - (^p, - 1) ){ }^ {2p, - iy^\^^ >} ] 

bedjW\a 

Let YC^) be the \U\ x |Q|-matrix with elements Y^^i - (2pb - 

Proposition 1. For (s), i/ Y^ e [-1, f/zen f/ze + (1 - (2p„ - 1)2){ ^ (2p6 - 

sequence 'Y'^^^ has a fixed point in [— 1, IJ'^I^I'^L bedjW\a 

Proof: Simply by the triangle inequality and induction, if = J_ \S \^ var*^*^"^^ 

l^r 1 iilC/lxlOUu_„ xrffcl ^ r i ti\U\x\0\ c 11;. a„ ^2 b^J 



(52 

jed-^a(>''>\i bedj(>'1\a 



Y(o) G [-1, 1]\UMQ\ then Y^'^) e [-1, 1]I^I>^I0I for all k. As 

the mapping ([8|l is continuous, we can apply Brouwer's fixed 2 _(fc-i) 2 

point theorem O to get existence. □ 



Note that, generally, fixed points of a continuous linear 
operator on a bounded domain needn't be unique. 



+ (i-(2p,-if){ i2Pb-M%'^r] 

6Gaj(fe)\a 

< EKE vari-/) + (Ai-/')n 

IV. A SERIES OF SIMILAR META-TASKS WITH ONE +{ E ^ 

ITERATION PER META-TASK bed]W\a 

Thus, 

S := (s- l)(r- 1) > 1, and , 

E(2p-1)2 e[0,l]. var^-;^ < « E 2^ var^ ^ 



We now consider a series of similar meta-tasks indexed fc and a 
single iteration as (j2|i for each on its component questions (all 



^ ^ b^i 

jed-^a('')\i bedjW\a 

(2Ep- l)2(E(2p- l)2)2fe 



questions similar to each other too). Moreover, each meta-task ^ nz-op 1^2/'|r/'o 1^2^2fe 

will reassign the component questions using an independently 
sampled degree-regular assignment. Obviously, the answers A 



5 

(r- l)(2Ep- l)2(E(2p- 1)2)2 

T 



will be independently resampled too. That is, here = — var 



ya^» - ^ ^ja r(2Ep - l)202/c 

jea-iaC'jXj 6eej('=)\a H . 

where now the A^^^ and y\^_yp terms are independent. The ^he proof then follows by induction, i.e., dropping dependence 

following asymptotic analysis is facilitated by this assumption on a i the previous display is 
on successive i.i.d. meta-tasks. 

Lemma 2. //fl and = Ifor all a, i, then for all fc > 1, var^''^ < ^var^''"^) + ^r(2Ep - ifcf)'^''. 

□ 



Ai'-l. « (2p„-l)(2Ep-l)0'=-i. (10) 



Proof: First note that by the argument for ^ and defi- 
nition (j9]l, ([To]) holds for fc = 1. The lemma is simply proven 

by induction. □ By direct substitution, we arrive at the following. 



Proposition 2. For ^^^under ([jjl, // 
1 



(s-l)(r-l) 
then for all a, i with pa ^ 0.5, 



< [E{2p-iyY < 1 



(12) 



0(1), 



i.e., the relative error is bounded as k 



Note that (|T2]i is just S^^ < c/)^ < 1. 

In ([To| ), the product (2pa — l)(2Ep— 1) determines the sign 
of /u^j^j (corresponding to the weight of user a for question 
i). So, if the crowd tends to be correct (Ep > 0.5) then (as 
expected): if a particular user a tends to be correct (pa > 0.5) 
then the sign of as weight Tja^i will tend to be positive, else 
negative. 

Revisiting ([T}, we can instead estimate the answer to ques- 
tion i of task k using normalized weights as 



;(fe) 



\bedi 



b iib^ 



An immediate consequence of the previous proposition is the 



following (noting in (10 1 is positive) 



Corollary 1. If the limiting relative error is sufficiently small, 
then 



as k oo. 



sgn(i2Ep-l)J2A^\2p,^l)\ (13) 
\ bedi / 



This expression is intuitive pleasing as the reliability of the 
individual user &'s responses An, are weighted by their own 
reliabilities (2pb — 1) essentially learned through the iterations. 
Also, the overall reliability of the crowd, E(2p— 1) — 2Ep — 
1, is a factor so that if the crowd is on average unreliable 
(E(2j3 — 1) < 0) the opposite response (sign change) of the 
system which favors the majority (the summation) will result. 

V. Numerical Study 



Considering the term <j) E{2p— 1)^ in ( 12 1, note that the 
crowd is unreliable if Ep is close to 0.5. If E{2p— 1)^ but 
is small, then intuitively the unreliability of the crowd can can 
be compensated for by sufficient correlation with the majority, 
e.g., if (12 1 holds and a sufficient number of iterations k are 
performed, otherwise the weights Y (or Y) may be too close 
to giving indeterminant decisions ([TJ. 

A model of a crowdsourcing assignment with single meta- 
task was simulated using Netlogo lfT6l . a multi-agent sim- 
ulation tool. The meta-task was an aggregate of 100 sub- 
tasks/questions to be assigned to a subset of 100 users. Using 



the method described in Section 111-A we performed degree 
regular question-to-user assignment. Each user was assigned 
10 questions, i.e., s — r — 10. We generated random 
reliabilities pa for each user a using normal distribution with 

' With or without the normalizing factor i . 



••Ep=0.5V=0.2 [E{2p-l)^]'' = 0.008 
-Ep=0.57V=0.4 [E{2p-1)^]^= 0.015 
■-Ep=0.6V=0.3 [E(2p-l)^]^ = 0.02 



Number of Iterations 



Fig. 1. Relative eiTor 
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a known mean Ep and variance V= E(p^) — (Ep)^. Using 
these reliabilities, random answers were generated by each 
user for the questions assigned. The weights ya^i for each link 
between user a and question i were randomly initialized with 
normal distribution with mean and variance equal to 1 as in 
lfT3l . We computed the values of j/a-s-i for all a and i according 
to (j2]| for A; = 15 iterations of message passing between the 
users and the questions and vice-versa. For each value of 
Ep and V we repeated the previous step (consisting of 15 
iterations of message passing) 50 times, each time generating 
new answers Aia for all i and a based on the reliabilities 
of the users. Figure [T| is a plot for a (typical) link {a,i) of 
Vyara-n Yg]-g^g iteration index (fc), for different values of Ep 
and V, where fia^i and vara_j.i are the sample mean and 
sample variances respectively of the weights of edge (a, i) 
for a given iteration. Note that ~ ST ~ 0.0123 

and when Ep = 0.5 {i.e., an unreliable group of users), 
[E(2p - 1)2]2 ^ 0.008 < 0.0123, there is no convergence, 
otherwise there is rapid convergence of the relative error to 
zero, so that the condition of Corollary [T] is met for this single 
meta-task experiment. 

In our next experiment we varied r i.e., the number of users 
assigned to a question and observed the average percentage 
error (the number of questions with incorrect answers derived 
through the weighted majority correlation method). The error 
values were steady state values i.e., when the iterative calcu- 
lation of weights Ua^i converged and we took the weighted 
correlation with the majority of users. The average was taken 
for 50 different random realizations of the question-to-user 
assignment and answers for the given value of r , Ep and V. 
Fig |2] shows the percentage error for different values of Ep as 
it decreases with r. Note that for all of these three pairs of Ep 
and V, the value of [E{2p— 1)^]^ was well above [yj^— [y- 
We observed an initial increase in the error approximately at 
r = 2, 3. A possible explanation for this could be the labeling 
of reliable users as unreliable (by lowering the weight ya^i for 
the user) in the light of insufficient samples to obtain correct 
correlation. 

VI. Similarity with LDPC Decoding via Message 
Passing 

The consensus framework described above has a marked 
resemblance with message passing algorithms of belief propa- 
gation for decoding of low-density -parity-check (LDPC) codes 
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Fig. 2. Effect of r on the percentage error for different user reliabilities 

LDPC codes were introduced in early sixties [9], but 
became popular only recently because of their ability to reach 
the Shannon limits on communication rates. LDPC codes 
typically have large lengths and rely on message passing over 
sparse bipartite graphs to reach a consensus on the transmitted 
codeword. One can think of leveraging the already mature 
theory of LDPC coding and decoding for the crowdsourcing 
model. For instance, the convergence properties of LDPC 
codes are normally studied under an independence assumption, 
i.e., that the messages received by each node are independent. 
This assumption is true for the first few iterations defined by 
the "girth length" of the codes, which is the length of the 
smallest cycle in the graph. In our model, girth length will 
depend on the question-to-user assignment. 

Consider a model where the weights j/a-j-j G [0, ll because 
instead of (|2|, 



(fe) 



1 



sgn' 



A 



Sgn 



^bedj\a 



where sgn+(a) 



sgn ' (a) — I if a > and otherwise. Here, Xj^i, 
gives the answer based on the expected value of answers given 
by all users a E dj\b, and Ua^i is the posterior probability 
that a answers i correctly given that we know the correct 
answers for all questions j £ d~^a\i. Alternatively, soft 
decisions can be used, e.g., by defining Xj^i, as the log- 
likelihood of the answer being or 1. The question nodes 
compute the user-reliability expectations, while the user nodes 
maximize the log-likelihood of the observed answers over their 
(estimated) reliabilities. So, this is similar to the Expectation- 
Maximization (EM) algorithm ||22|. 

The use of EM algorithm is natural in this scenario since we 
have to estimate a set of decision variables (correct answers) 
along with latent variables (the user reliabilities) [8J. Apart 
from the user reliabilites, one can also think of considering 
difficulty of the tasks as another set of latent variables 1*231. 
The M-step of the EM algorithm poses some computational 
challenge and most of the known work in this area seeks to 
find a solution to the maximization step by using softwares that 
use numerical optimization techniques. From this perspective 
the message passing framework can be viewed as an alternate 
local or distributed optimization technique that takes node-by- 
node decisions iteratively. It will be interesting to find the cost 
of using such a framework in terms of the loss in optimality. 



VII. Summary 

This paper studied iterative, unsupervised crowdsourcing 
frameworks wherein the weights of users' answers are de- 
termined by correlating their responses to the majority. We 
considered the case of a single meta-task and multiple inde- 
pendent meta-tasks, deriving an asymptotic result for the latter. 
Numerical experiments for multiple iterations on a single 
meta-task show that the iteration does not converge when 
the the crowd is unreliable, but rapid convergence otherwise 
results. Finally, we briefly described how these crowdsourcing 
frameworks are related to LDPC decoding and EM. 
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